In this lecture, we’ll focus on visualising large numbers of continuous variables.
One trivial extension to a method we’ve seen before is to include a third numerical variable into a scatter plot. One way to achieve this is a bubble plot, where we use the value of a third variable to control the size of the points we draw on a scatterplot:
library(plotly)
plot_ly(iris, x=~Petal.Width, y=~Petal.Length, size=~Sepal.Width, color=~Species,
sizes = c(1,50), type='scatter', mode='markers', marker = list(opacity = 0.7, sizemode = "diameter"))
Here we have positioned the points of the iris data
according to the Petal measurements and used the size of the point to
indicate the Sepal.Width. We can note that the orange
points (versicolor) appear to have more small bubbles than the others,
and the green points (setosa) look to have larger values.
This technique can be quite effective when the variable corresponding to point size corresponds to some measure of importance, scale, or size (such as population per country, number of samples in a survey, number of patients in a medical trial). While it can be effectively used in three-variable problems, for more variables than this we require other methods.
Dealing with more continuous variables will require scaling up a lot of the familiar techniques from earlier. Individual numerical summaries can still be computed, though it is even more difficult to easily compare. Some of the standard visualisations can be easily applied to many variables, such as histograms and box plots.
One particularly useful technique is to take scatterplots, and draw many of them in a matrix to effectively compare multiple variables at once. Scatterplot matrices are a matrix of scatterplots with each variable plotted against all of the others.
Like the grid or trellis scatterplot we produce an array of plots, but now we plot all variables simultaneously!
These give excellent initial overviews of the relationship between continuous variables in data sets with relatively small numbers of variables.
Let’s use a data set on Swiss banknotes to illustrate variations we can make to a scatterplot matrix.
The data are six measurements on each of 100 genuine and 100 forged Swiss banknotes. The idea is to see if we can distinguish the real notes from the fakes on the basis of the notes dimensions only:
library(car)
data(bank,package="gclus")
scatterplotMatrix(bank[,-1], smooth=FALSE, regLine=FALSE, diagonal=TRUE, groups=bank$Status,
col=c(5,7), pch=c(16,16))
data(airquality) to load the built-in airquality
data setscatterplotMatrix() function in the package
car to produce a scatterplot matrix of the four variablse
in the data setThe plot with default options looks like this:
data(airquality)
scatterplotMatrix(airquality[,-c(5,6)])
we can make this nicer by excluding some of the features and formatting the dots:
scatterplotMatrix(airquality[,-c(5,6)], smooth=FALSE, regLine=FALSE, pch=16)
If we want histograms on the diagonal, we have to make good use of the option diagonal.
scatterplotMatrix(airquality[,-c(5,6)], smooth=FALSE, regLine=FALSE, pch=16, diagonal=list(method ="histogram"))
Parallel coordinate plots** (PCP) have become a popular tool for highly-multivariate data.
Rather than using perpendicular axes for pairs of variables and points for data, we draw all variables on parallel vertical axes and connect data with lines.
The variables are standardised, so that a common vertical axis makes sense.
par(mfrow=c(1,2))
pairs(iris[,4:1],pch=16,col=c('#f8766d','#00ba38','#619cff')[iris$Species])
library(lattice)
parallelplot(iris[,1:4], col=c('#f8766d','#00ba38','#619cff')[iris$Species], horizontal=FALSE)
In general: * Each line represents one data point. * The height of each line above the variable label indicates the (relative) size of the value for that data point. The variables are transformed to a common scale to make comparison possible. * All the values for one observation are connected across the plot, so we can see how things change for individuals as we move from one variable to another. Note that this is sensitive to the ordering we choose for the variables in the plot – some orderings of the variables may give clearer pictures than others.
For these data: * We can see how the setosa species (red) is separated from the others on the Petal measurements in the PCP by the separation between the group of red lines and the rest. * Setosa (red) is also generally smaller than the others, except on Sepal Width where it is larger than the other species. * We can also pick out outliers, for instance one setosa iris has a particularly small value of Sepal Width compared to all the others.
Using parallel coordinate plots: * Reading and interpreting a parallel coordinate plot (PCP) is a bit more difficult than a scatterplot, but can be just as effective for bigger problems. * When we identify interesting features, we can then investigate them using more familiar graphics. * A PCP gives a quick overview of the univariate distribution for each variable, and so we can identify skewness, outliers, gaps and concentrations. * However, for pairwise properties and associations then it’s best to draw a scatterplot.
The Guardian newspaper in the UK publishes a ranking of British universities each year and it reported these data in May, 2012 as a guide for 2013. 120 Universities are ranked using a combination of 8 criteria, combined into an ‘Average Teaching Score’ used to form the ranking.
library(GDAdata)
data(uniranks)
## a little bit of data tidying
names(uniranks)[c(5,6,8,10,11,13)] <- c('AvTeach','NSSTeach','SpendPerSt','Careers',
'VAddScore','NSSFeedb')
uniranks1 <- within(uniranks, StaffStu <- 1/(StudentStaffRatio))
## draw the scatterplot matrix
pairs(uniranks1[,c(5:8,10:14)])
We can see some obvious patterns and dependencies here. What about the
parallel coordinate plot?
parallelplot(uniranks1[,c(5:8,10:14)], col='black', horizontal=FALSE)
Can we learn anything from this crazy mess of spaghetti?
PCPs are most effective if we colour the lines to represent subgroups of the data. We can colour the Russell Group universities, which unsurprisingly are usually at the top (except on NSSFeedback!)
## create a new variable to represent the group we want
uniranks2 <- within(uniranks1,
Rus <- factor(ifelse(UniGroup=="Russell", "Russell", "not")))
parallelplot(uniranks2[,c(5:8,10:14)], col=c("#2297E6","#DF536B")[uniranks2$Rus], horizontal=FALSE)
Using colour helps us to diffentiate the lines from the different data points more clearly. Now we can start to explore for features and patterns:
A scatterplot matrix corroborates most of these features, though we’re probably near the limit of what we can read and extract from such a plot!
pairs(uniranks2[,c(5:8,10:14)],col=c("#2297E6","#DF536B")[uniranks2$Rus], pch=16)
The Boston housing data contains 1 binary, 1 discrete, and 12 numeric variables. Looking first at the categorical variables:
data(Boston, package='MASS')
par(mfrow=c(1,2))
for(i in c("chas","rad")){
barplot(table(Boston[,i]), main=paste("Barchart of",i),col='#72abe3')
}
We note the following features which should probably be explored further:
chas has relatively few 1 values in the
datarad the values of 4, 5,
and 24 seem to dominate, and there’s a big gap between
8 and 24 with no observations.Turning now to histograms of the continuous variables:
## select only the continuous variables
cts <- !(names(Boston) %in% c("chas","rad"))
par(mfrow=c(3,4))
for(i in names(Boston)[cts]){
hist(Boston[,i], col='#72abe3', xlab='', ylab='', main=paste("Barchart of",i))
}
Here we identify the following features worthy of a closer look:
crim, zn and black - all
exhibit strong skewnessindus and tax display strange spikes on
certain values, with noticeable gaps in the distribution with no
observed valuesfor loop to iterate over the
continuous variablesLooking first at a scatterplot matrix of the continuous variables:
library(scales)
pairs(Boston[,cts],pch=16,col=alpha('black',0.2))
crim and znnox, rm, age, and
dis in the centre of the plottax,
pratio, black and indusmdev as a response
variable, then there are further dependencies with other variables as
shown in the bottom row/rightmost column.round(cor(Boston),2) ## round to 2dp
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
library(corrplot)
corrplot(cor(Boston))
There are clearly strong associations between
age and
dis, indus and dis, and
lstat and medv that may be worth studying
closer. chas appears almost uncorrelated to all the other
variables – but remember chas was a binary variable, and so
the correlation coefficient is meaningless here and we should not draw
conclusions from this feature!
library(gplots)
heatmap.2(as.matrix(Boston), scale='column',trace='none')
Here we can see: * The lower quarter of the heat map shows a lot of pale orange/yellow columns - the data points here typically take low values, and do so on multiple variables * The second column from the right appears to be roughly constant for most of the data, apart from a cluster of cases with very high (red) values.
data <- iris[, 1:4] to load the numerical
values in the iris data setdata <- scale(data)heatmap() function
(no need to install any package). What do the colours represent?# Load and preprocess data
data <- iris[, 1:4] # Select only numeric columns
data <- scale(data) # Normalize the data
# Adjust margins
par(mar = c(20, 8, 2, 2))
# Create heatmap. Options needed to visualise the full names of the variables.
heatmap(data, cexCol = 0.9, labCol = colnames(data))
Colours represent departure from the mean of the variable values.
Yellow=small values, Red=large values. Numbers on the right are
observations, ordered in a way that groups the more similar together.
The way grouping is done is shown in the dendrogram on the
left. Dendrogram on the top is similar, but for similarities between
variables.
Here is a simpler version of the heatmap (for advanced graphics sometimes you need extra options just to get rid of the unwanted information!)
heatmap(data,
Colv = NA, Rowv = NA, # Remove dendrograms
col = heat.colors(256), # Use a nice color palette
main = "Simple Heatmap of Iris Data",
cexCol = 0.9, # Increase column label size
labCol = colnames(data))